Automatic Record Extraction for the World Wide Web

نویسندگان

  • Yuan Kui Shen
  • David R. Karger
  • Arthur C. Smith
چکیده

As the amount of information on the World Wide Web grows, there is an increasing demand for software that can automatically process and extract information from web pages. Despite the fact that the underlying data on most web pages is structured, we cannot automatically process these web sites/pages as structured data. We need robust technologies that can automatically understand human-readable formatting and induce the underlying data structures. In this thesis, we are focused on solving a specific facet of this general unsupervised web information extraction problem. Structured data can appear in diverse forms from lists to trees to even semi-structured graphs. However, much of the information on the web appears in a flat format we call “records”. In this work, we will describe a system, MURIEL, that uses supervised and unsupervised learning techniques to effectively extract records from webpages. Thesis Supervisor: David R. Karger Title: Professor

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extraction of Flat and Nested Data Records from Web Pages

This paper studies the problem of identification and extraction of flat and nested data records from a given web page. With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright n...

متن کامل

Unsupervised Structured Data Extraction from Template-generated Web Pages

This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle t...

متن کامل

Automatic Text Summarization in TIPSTER

Automatic Text Summarization was added as a major research thrust of the TIPSTER program during TIPSTER Phase III, 1996-1998. It is a natural extension of the previously supported research efforts in Information Extraction (IE) and Information Retrieval (IR). There is considerable interest in automatically producing summaries due, in large part, to the growth of the Internet and the World Wide ...

متن کامل

The Web-OEM approach to Web information extraction

The enormous amount of information available through the World Wide Web requires the development of effective tools for extracting and summarizing relevant data from Web sources. In this article we present a data model for representing Web documents and an associated SQL-like query language. Our framework provides an easy-to-use and well-formalized method for automatic generation of wrappers ex...

متن کامل

Functionality-Based Web Image Categorization

The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. Identifying the functional categories of these images has important applications including information extraction, web mining, web page summarization and mobile access. This paper describes a study on the functional cate...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006